Homework 8

Author: Puri Rudick

Vectorize Reviews using TF-IDF

The elbow method plot above shows distortions that is slightly straight line but shows a bit of convergeance around 6, 9, and 16 so these will be the values of k that we will be using in this Homework.

Create a function to fit k-Mean model, to print top 10 common words from each cluster, and to plot wordcloud for each cluster.

k = 6

k = 9

k = 16


Perform a vocabulary-based sentiment analysis of the movie reviews you used in homework 5 and homework 7, by doing the following:

  1. In Python, load one of the sentiment vocabularies referenced in the textbook, and run the sentiment analyzer as explained in the corresponding reference.
    Add words to the sentiment vocabulary, if you think you need to, to better fit your particular text collection.

  1. For each of the clusters you created in homework 7, compute the average, median, high, and low sentiment scores for each cluster. Explain whether you think this reveals anything interesting about the clusters.

k = 6


For k=6, all clusters (except for Cluster #4) have mean polarity score > 0 which can be implied a positive sentiment. This is because of more positive terms, especially Cluster #3, in those clusters which clearly display on the swarmplot above.
Cluster #4 is the only cluster that has mean polarity score < 0 (but really close to 0) and the swarmplot shows no majority of possitive or negative score.

k = 9


For k=9, all clusters have mean polarity score > 0. However, only Cluster #6 and #7 that have high possitive.

k = 16


For k=16, all clusters (except for Cluster #8, #11, and #13) have mean polarity score > 0 which can be implied a positive sentiment.
Cluster #6 and #10 that have high possitive.


  1. For extra credit, analyze sentiment of chunks as follows:
    • Take the chunks from homework 5, and in Python, run each chunk individually through your sentiment analyzer that you used in question 1. If the chunk registers a nonneutral sentiment, save it in a tabular format (the chunk, the sentiment score).
    • Now sort the table twice, once to show the highest negative-sentiment-scoring chunks at the top and again to show the highest positive-sentiment-scoring chunks at the top. Examine the upper portions of both sorted lists, to identify any trends, and explain what you see.

Top 10 Possitive Sentiment Score


Looking at Top 10 Possitive Sentiment Score, we can see that there are a lot of possitive individual words in each review.

Top 10 Negative Sentiment Score


Same thine as Top 10 Nossitive Sentiment Score, we can see that there are a bunch of negative individual words in each review.